Did a PhD in English from 2013 and freelanced
Data journalist at The Times and Sunday Times since 2016
Data advisor at Global Witness from next month
Cyber-crime and the dark net
Transparency and open data
Bringing innovative techniques to data journalism
R programming language
Tidyverse family of packages
Elasticsearch and Kibana
Access
Amalgamation (OK, this one’s not great)
Analysis
People talk a lot about creating new data
Another way to think of it: accessing hidden data
Examples: web scraping, using APIs, getting data from PDFs and Word documents, working with data bigger than Excel can handle
2 + 2 = 5
When combined, data is more than the sum of its parts
Examples: joins and fuzzy matching, working with weird file formats, tidying data
How can programming help us see stories?
Visualisation is important, but not the be all and end all
Examples: geospatial analysis, times series analysis, statistical analysis, search
Getting information from a website into structured form
90% of scraping jobs for stories follow this format:
library(tidyverse)
library(rvest)
read_html("https://www.thetimes.co.uk/") %>%
html_nodes(".Item-headline") %>%
html_text()## [1] "Clare Foges"
## [2] "Times Daily Quiz"
## [3] "Top trumps"
## [4] "Yasmin Green"
## [5] "Three dead and several hurt in Dutch tram shooting"
## [6] "May woos Brexiteers but still faces defeat"
## [7] "Deal or no deal? How the numbers add up"
## [8] "The DUP needs a deal just as much as May"
## [9] "What happens if the PM’s deal is rejected again?"
## [10] "Duty‑free drinking ban for air passengers"
## [11] "Christchurch suspect ‘not mentally unstable’"
## [12] "Three teenagers die in St Patrick’s Day disco stampede"
## [13] "Cyclist critical after being kicked off bike"
## [14] "Sniff a lemon or squeeze your ear to pass exams, pupils told"
## [15] "Child sex abuse survivors claim ministers are still punishing them"
## [16] "Police fail to protect Rotherham victim from troll"
## [17] "No junk food ads on Facebook until 9pm in plan to fight obesity"
## [18] "What did happen to the Likely Lads? Now we know"
## [19] "Pray for us, pleads father of girl, 4, fighting for her life"
## [20] "Suspect ‘spent time in Britain during his tour of Europe’"
## [21] "We shared house with polite loner, say couple"
## [22] "Prime minister and mourners gather in song"
## [23] "Live broadcast of massacre prompts calls for regulator"
## [24] "Polls show Islamophobia gaining a foothold"
## [25] "Call to remember heroes, not the villain"
## [26] "Man, 50, held as police link Tesco stabbing to far right"
## [27] "Legal aide who gave bad advice told to pay £260,000"
## [28] "Deadlock derails bid for greater local democracy"
## [29] "Where EU leaders stand on Article 50 extension"
## [30] "Blocking deal will alienate voters, Fox tells hardline MPs"
## [31] "Corbyn might back Leave if there is a second referendum"
## [32] "Hammond denies bribing DUP with cash for Northern Ireland"
## [33] "Church of England backs tea parties to heal divide"
## [34] "This honeymoon could last a while..."
## [35] "Lab closures ‘would delay cancer tests’"
## [36] "Month’s rain in one day floods homes"
## [37] "One man and his drone could replace sheepdogs"
## [38] "Serial adventurer to take on world via gyrocopter"
## [39] "Prostate cancer is ‘less deadly than thought’"
## [40] "Radio rivals snub offer to join BBC Sounds app"
## [41] "Social media firms face demands for mental health tax"
## [42] "Faceless foreigners stash £100bn in UK property"
## [43] "Second Love Island contestant dies in a year"
## [44] "Ewe can’t hurry love: sperm from 1968 produces lambs"
## [45] "I’m with Willy Wonka and Rees-Mogg against the health police"
## [46] "I had two weeks away from Westminster. It looked horrifying"
## [47] "The UK must continue to lead the world order as a force for good"
## [48] "Can it really be second time lucky at the polls for Corbyn?"
## [49] "News in pictures"
## [50] "We’ve let loose the destructive genies of tech"
## [51] "Brexiteers ignore the US-Irish lobby at their peril"
## [52] "Great book characters demand a new chapter"
## [53] "Footy fans are in a league of their own for superstitions"
## [54] "British science needs to remain open to the world"
## [55] "Beyond Hatred and Fanaticism"
## [56] "Fat Chance"
## [57] "Woolly Thinking"
## [58] "Brexit and the vote for May’s deal"
## [59] "Nature notes"
## [60] "Birthdays today"
## [61] "‘White Obama’ grips Democrat primary race"
## [62] "Erdogan shows mosque attack footage at election rally"
## [63] "Flight data shows Boeing jets suffered same failures"
## [64] "Macron under fire as yellow vest violence wrecks Champs Elysées"
## [65] "Berlin’s new ice princess makes her public debut"
## [66] "Hamas rocket launch was ‘a Monty Python’ mistake"
## [67] "Former Putin adviser had injury consistent with being strangled"
## [68] "Russia ‘will sow dissent during MEP elections’"
## [69] "Outrage at huge adverts on Venice churches"
## [70] "Iran jails US veteran for 10 years"
## [71] "Miners plunder Venezuelan forests for gold"
## [72] "Murder inquiry over model Berlusconi says he never met"
## [73] "NY mob boss shot ‘to avenge broken heart’"
## [74] "Ex-Nazis ‘gave Mossad the edge in Six-Day War’"
## [75] "Students held after online game banned"
## [76] "Game that celebrates winter ennui strikes a chord with Muscovites"
## [77] "German banks see way clear for €25bn merger"
## [78] "Investment lost to Brexit ‘may never come back’"
## [79] "JD Sports snaps up Footasylum for £90m"
## [80] "Apple relying on Hollywood royalty to win Netflix’s crown"
## [81] "Interserve was handed work despite crisis"
## [82] "Nervous households hold back on major purchases"
## [83] "Let’s not push minimum wage too far without studying the evidence"
## [84] "Chief executives can’t expect huge rewards just for being lucky"
## [85] "Intu malls on Canada shopping list"
## [86] "Iron ore shortage lifts mining stocks"
## [87] "Sorrell’s S4 Capital reports 30% revenue growth"
## [88] "Britain warned over legality of border plan"
## [89] "Anxious small companies seek port in Brexit storm"
## [90] "Pre-pack an option for Debenhams"
## [91] "Daily Mail owner joins call to break up Google and Facebook"
## [92] "Zero tariffs ‘will save £10bn’ under no-deal"
## [93] "House price rises low for springtime"
## [94] "Glencore offices raided in India over alleged price-fixing cartel"
## [95] "Moving abroad in mind for car parts suppliers"
## [96] "Footasylum board throw in the towel"
## [97] "Fast-growing firms confident of more growth"
## [98] "March of budget gyms shows no sign of fatigue"
## [99] "Late payers may answer to their audit committees"
## [100] "Green light for trio to sue classic car dealer"
## [101] "The week ahead"
## [102] "Farrell can win the World Cup . . . he can also lose it"
## [103] "Team Sky saved by Britain’s wealthiest man"
## [104] "Who would make a Lions XV if it was picked today?"
## [105] "Cult heroes: Cambridge striker who wore odd boots and punched his manager"
## [106] "Jones wants psychologist to rid England of the ghosts of 2015"
## [107] "‘It used to be cornflakes and toast for breakfast. Now it’s masala dosa’"
## [108] "England must turn to Itoje and Gatland for leadership roles"
## [109] "Why 48-team World Cup is a good idea"
## [110] "Calls for calm amid Elland Road angst"
## [111] "Milner helps Liverpool overcome test of nerve"
## [112] "Silva’s salvation piles all the pressure on Sarri"
## [113] "Martin’s late mistake punished as Brighton reach semi-finals"
## [114] "Ronaldo charged by Uefa after mocking Simeone’s ‘obscene’ celebration"
## [115] "Gatland’s final Wales team could be best yet"
## [116] "Jones lauded as first among Welsh greats"
## [117] "Writers’ verdict: the good, the bad – and Alun Wyn"
## [118] "Boy genius Russell adds maturity to his box of tricks"
## [119] "World Cup ready? How England rated in Six Nations"
## [120] "Dominant England promise there is more to come"
## [121] "Edwards set for Wasps – if RFU don’t call"
## [122] "Fiery Bottas silences his critics with opening win"
## [123] "Puppeteer Sexton seems to have had his strings snipped"
## [124] "British rookies take team honours but no points"
## [125] "Jones sets England on their way to series victory over Sri Lanka"
## [126] "Ritchie goal sparks arrests"
## [127] "Azpilicueta: we cannot afford to prioritise the Europa League"
## [128] "Fulham should take heart from defensive improvements"
## [129] "Talking points: it is time we had a second penalty spot from 18 yards"
## [130] "Only 25 top-flight games – but more Wales caps than anyone"
## [131] "City need more from Mahrez if they are to land quadruple"
## [132] "‘Tired’ Guardiola goes on holiday"
## [133] "The ghost of Mourinho visits United"
## [134] "Maddison sends out Southgate reminder"
## [135] "Duggan on target before world-record club crowd"
## [136] "‘Sensational result,’ declares Klopp as his side make return to summit"
## [137] "England call up Ward-Prowse after three players pull out of squad"
## [138] "Arnautovic is shown the way by Hernández"
## [139] "Affection for ‘captain of the bench’ gives Gracia a dilemma"
## [140] "City show gulf in class and resources to reach last four"
## [141] "Wildcard Andreescu shocks Kerber to win Indian Wells title"
## [142] "Sarri: my team’s problems are mental"
## [143] "Banned duo Smith and Warner get ‘awesome’ welcome"
## [144] "Powell urges Castleford to shape up before St Helens"
## [145] "Champion tipster of the year Rob Wright’s racing tips"
## [146] "Game in numbers: VAR gives United’s Victor A Reprieve"
## [147] "Paul Hutchins"
## [148] "Pia, Lady Chelwood"
## [149] "Peter Hurford"
## [150] "Peter Block"
## [151] "Mine clearance service"
## [152] "March 16 & 17"
## [153] "Last vestiges of French empire in North America"
## [154] "Crossword Club"
## [155] "Times Concise No 7915"
## [156] "Times Quick Cryptic No 1310"
## [157] "Times Cryptic No 27301"
## [158] "Concise Quintagram No 326"
## [159] "Cryptic Quintagram No 326"
## [160] "Sudoku No 10570 Difficult"
## [161] "Sudoku No 10571 Fiendish"
## [162] "Sudoku No 10569 Easy"
## [163] "Killer Sudoku No 6489 Tricky"
## [164] "Killer Sudoku No 6488 Gentle"
## [165] "Brain Trainer No 2830"
## [166] "Cell Blocks No 3482"
## [167] "Codeword No 3599"
## [168] "Futoshiki No 3391"
## [169] "Kakuro No 2350"
## [170] "KenKen No 4591"
## [171] "Lexica No 4701"
## [172] "Lexica No 4702"
## [173] "Polygon"
## [174] "Set Square No 2353"
## [175] "Suko No 2500"
## [176] "Bridge"
## [177] "Chess"
## [178] "The British housewife of Beverly Hills: inside the world of Lisa Vanderpump"
## [179] "How to stop girls becoming jihadi brides"
## [180] "The head teacher who also cleans the school loos"
## [181] "I helped the wealthy get in to Oxford"
## [182] "Punk meets politics at the world’s weirdest music festival"
## [183] "Ask Tanya Byron: Should I leave my horrible family behind?"
## [184] "Kevin Maher: Man in forties in major sports shock. No, not James Cracknell — me!"
## [185] "The Times Daily Quiz"
## [186] "Ellie Goulding’s hen party"
## [187] "Billy Bishop Goes to War at the Southwark Playhouse, SE1"
## [188] "The Magic Flute at the London Coliseum"
## [189] "The Thread at Sadler’s Wells"
## [190] "RLPO/Petrenko at the Philharmonic Hall, Liverpool"
## [191] "Pussy Riot at Unit 8, SE15"
## [192] "TV review: Skeletons of the Mary Rose: The New Evidence; Midsomer Murders"
## [193] "What’s on TV tonight"
## [194] "Lindsey Bareham’s brown shrimp, dill and pea pasta"
## [195] "Ben is Back"
## [196] "Scotland stalls on appointment of first gay bishop"
## [197] "Funding for vulnerable pupils has shrunk by a quarter, charities say"
## [198] "Scottish rugby stars pitch their own coffee brewing business"
## [199] "Landowners feel heat over grouse moor burning"
## [200] "Holyrood could ‘legally’ hold fresh vote on independence"
## [201] "Doctors quit hospital over safety concerns, MSPs told"
## [202] "US politician grilled over film company"
## [203] "Lighthouse is repainted in wild weather"
## [204] "‘Spend more’ to reduce self-harming by prisoners"
## [205] "Rail firms get £20m for Easter travel hell"
## [206] "Tories call for law review of leniency test"
## [207] "Popularity of island crime drama is open and shut case"
## [208] "Heat maps reveal damp in Mackintosh treasure"
## [209] "Scientists stop spuds having their chips"
## [210] "Walkers go on warpath over Highland track"
## [211] "Use drug dogs at football, police urged"
## [212] "Oor Wullie, the musical"
## [213] "£1.7m grant lets Indian firm reopen Pinneys plant"
## [214] "Reshuffle for board at housebuilder"
## [215] "Apple, the Stones and secrets of those brands that won’t fade away"
## [216] "Scottish fiction has lost its fashionable swing"
## [217] "Laidlaw had a big say in fightback"
## [218] "McLeish blow as Robertson has to pull out"
## [219] "They made bad fouls — like rugby, says Katic"
## [220] "Higgins’s message of shared heritage sent around world"
## [221] "McDonald told to grow up over ‘England get out of Ireland’ banner"
## [222] "Allegations about FAI handed to politician"
## [223] "Destroying illicit goods cost Revenue more than €170,000"
## [224] "Brexit fear drives passport applications up by a third"
## [225] "Church institutions behind delays in payments for laundry survivors"
## [226] "Full Irish laid on for royal mini‑moon"
## [227] "Party event is axed after campaigner complaints"
## [228] "Urine DNA test for prostate cancer could prevent deaths"
## [229] "Brexit is an Orwell farce staged by amateurs"
## [230] "Grey is winning but I’ll go down fighting"
## [231] "Bus overhaul ‘will rip up our communities’"
## [232] "Actor ‘exported’ from Ireland plays patron saint in Galway pageant"
## [233] "Mayor rains on Limerick’s parade after hurling champions snubbed"
## [234] "A shining spring appearance"
## [235] "What did happen to the Likely Lads? Now we know"
## [236] "Taxman’s blind eye ‘eases way for smugglers’"
## [237] "Gain of €21m on extended Enet deal"
## [238] "Calculated Risk"
## [239] "School-starting age should not be a guessing game"
## [240] "Vodafone will expand 5G testing"
## [241] "Penalty point offences drop as drivers put brake on speed"
## [242] "‘Out of control’ board blocked hotel"
## [243] "University given go ahead for off-licence on campus"
## [244] "Schmidt: I’ll fix Ireland before Japan"
## [245] "In-form Byrne is given nod by McCarthy"
## [246] "Six months for Schmidt to sort it out"
## [247] "Ringrose: no confidence crisis"
## [248] "Even guy outside in the burger van has figured us out"
## [249] "Sexton, the coach . . . no one can be immune from Irish inquest"
## [250] "Corofin ease past Crokes to join the football pantheon"
## [251] "League exit turns focus to All-Ireland for Gavin and Dublin"
## [252] "Ruthless Waterford shoot lights out to hand Clare a drubbing"
## [253] "Shefflin delivers success at his first attempt"
## [254] "Ireland struggle after decisions go against them"
## [255] "Irish women are ‘punished’ for mistakes"
API: application programming interface
Structured way for programs to talk to each other
Lots of organisations provide them: private sector and public sector
Endpoint: “/Published/Notices/OCDS/Search”
Parameter: “stages=award”
Parameter: “order=ASC”
Parameter: “page=1”
Live demo! https://www.zap-map.com/live/
Derived from relational algebra; popularised by SQL databases
Think of an Excel VLOOKUP function
Basic but misunderstood…
Find matching rows between two tables based on a column
Create a new table with just the matching rows
Find matching rows between two tables based on a column
Stick the matches from table B on to the right-hand side of table A (or vice versa)
Data entered by humans is messy
Louis Goddard / Louis Godard / L Goddard / Goddard, Louis / Mr Louis Goddard
Needs to be standardised before it can be joined
library(fuzzyjoin)
tax_exiles <- data.frame(
country = c("Monaco", "monaco", "Switzerland", "Monnaco"),
year = c(2013, 2014, 2013, 2015),
tax_exiles = c(56, 23, 245, 35))
countries <- data.frame(
country = c("Monaco", "Switzerland"))
stringdist_inner_join(countries, tax_exiles, by = "country")## country.x country.y year tax_exiles
## 1 Monaco Monaco 2013 56
## 2 Monaco monaco 2014 23
## 3 Monaco Monnaco 2015 35
## 4 Switzerland Switzerland 2013 245
Difficult to write stories based on statistics. Hard to grasp intuitively
Issues of trust. What constitutes an outlier?
Visualisation can help, but only so much
Database of blood test results leaked to Insight team and ARD
Score generated based on consultation with two expert sources
Suspicious results for medal-winners sent to experts for confirmation
Huge opportunities: not many data journalists do it well!
Not all about mapping. Useful for amalgamation of data sources, e.g. how many Xs are there in Y area?
Benefits from using reproducible workflows rather than graphical GIS software
Shapefiles of licence areas from Oil and Gas Authority
Data on commercial and corporate land ownership from Land Registry
Postcode location data from Ordnance Survey
Developed as a statistical language in the ’90s, based on S
A ‘scripting language’: used for quickly arranging workflows rather than building software
Very modular and extensible – can be adapted for different purposes
Data journalists and academics vs. developers
Tidyverse vs. Pandas
Vectors and pipes vs. loops (and other stylistic differences)
Jupyter vs. RMarkdown
Lots of non-programmers = lots of support available online
Stack Overflow – Q&A website
#rstats hashtag on Twitter
RStudio Community forum
Integrated development environment (IDE)
Basically a text editor with bells and whistles
Not essential for working with R, but makes it much easier!
Also the name of the company that makes RStudio
Employs lots of R package developers and maintainers, particularly with Tidyverse packages
A force for good in the R ecosystem! For now, anyway…
A collection of packages that work together to make data science (and data journalism) easier
Key packages: readr (reading and writing data), dplyr (transforming data), tidyr (tidying data), ggplot2 (visualisation)
r4ds.had.co.nz
‘Literate programming’ – code and explanation flow seamlessly together
Plain-text, human-readable format – an advantage over Jupyter notebooks
Output to many different formats: PDF reports, web pages, even slides!
Free company data product
Simple CSV with details on every active company
Name, address, SIC codes, accounts data, etc.
Persons with significant control (PSC) snapshot
Big NDJSON file showing people who control UK companies
Needs a bit of processing, but extremely useful
Commercial and corporate ownership data (CCOD)
Land in England and Wales owned by companies and corporate entities (e.g. government, the Church, etc.)
Doesn’t include reliable geolocation information 😭 (not all addresses have postcodes)
Overseas companies ownership data (OCOD)
Cabinet Office open contracting data
API giving details of all contract tenders and awards
Tussell sells this for £700/month – and it’s all free!
Energy Performance Certificate data
Food hygiene rating data
Think laterally!